REVISITING MARGINAL REGRESSION 



By Christopher R. Genovese Jiashun Jin and Larry 

Wasserman 

Carnegie Mellon University 
November 20, 2009 

The lasso has become an important practical tool for high dimen- 
sional regression as well as the object of intense theoretical investiga- 
tion. But despite the availability of efficient algorithms, the lasso re- 
mains computationally demanding in regression problems where the 
number of variables vastly exceeds the number of data points. A much 
older method, marginal regression, largely displaced by the lasso, of- 
fers a promising alternative in this case. Computation for marginal 
regression is practical even when the dimension is very high. In this 
paper, we study the relative performance of the lasso and marginal 
regression for regression problems in three different regimes: (a) ex- 
act reconstruction in the noise-free and noisy cases when design and 
coefficients are fixed, (b) exact reconstruction in the noise-free case 
when the design is fixed but the coefficients are random, and (c) re- 
construction in the noisy case where performance is measured by the 
number of coefficients whose sign is incorrect. 

In the first regime, we compare the conditions for exact recon- 
struction of the two procedures, find examples where each procedure 
succeeds while the other fails, and characterize the advantages and 
disadvantages of each. In the second regime, we derive conditions un- 
der which marginal regression will provide exact reconstruction with 
high probability. And in the third regime, we derive rates of conver- 
gence for the procedures and offer a new partitioning of the "phase 
diagram," that shows when exact or Hamming reconstruction is ef- 
fective. 

In addition to theoretical investigation, we present simulations 
showing that in practice, marginal regression and the lasso can have 
comparable performance, while the computational advantages of marginal 
regression make it feasible for much larger problems. 
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1. Introduction. A central theme in recent work on regression is that 
sparsity plays a critical role in effective high-dimensional inference. Consider 
a regression model, 

(1) Y = XP + z, 

with response Y = (Y\, . . . ,Y n ) T , n x p design matrix X, coefficients p = 
(Pi, . . . , P P ) T , and noise variables z = (z\, . . . , z n ) T . Loosely speaking, this 
model is high-dimensional when p> n and is sparse when many components 
of P equal zero. 

An important problem in this context is variable selection: determining 
which components of P are non-zero. For general P, the problem is un- 
derdetermined, but recent results have demonstrated that under particular 
conditions on X, to be discussed below, sufficient sparsity of P allows (i) ex- 
act reconstruction of P in the noise-free case [28] and (ii) consistent selection 
of the non-zero coefficients in the noisy-case [31 HI El El [TBI HZ1 H9J [211 [281 
[501 E21 S3] • Many of these results are based on showing that under spar- 
sity constraints, a convex optimization problem that controls the I 1 norm 
of the coefficients has the same solution as an (intractable) combinatorial 
optimization problem that controls the number of non-zero coefficients. 

In practice, the lasso [51 [26] has become one of the main tools for sparse 
high-dimensional variable selection, due both to its computational simplicity 
and its direct connection to these theoretical results. The lasso estimator in 
the regression problem is defined by 

(2) p lasso = argmin ||Y - Xpf 2 + A||/%, 

P 

where \\P\\i = J2j \Pj\ an d A > is a regularization parameter that must be 
specified. The lasso gives rise to a convex optimization problem and thus is 
computationally tractable even for moderately large problems. Indeed, the 
LARS algorithm [H] can compute the entire solution path as a function of 
A in 0(p 3 + np 2 ) operations. Gradient descent algorithms for the lasso are 
faster in practice, but have the same computational complexity. For very 
large p, the lasso remains computationally demanding. 

A much older and computationally simpler method for variable selection 
is marginal regression (also called correlation learning, simple thresholding 
[B], and sure screening [IB]), in which the outcome variable is regressed on 
each covariate separately. To compute the marginal regression estimates for 
variable selection, we begin by computing the marginal regression coefficients 
which, assuming X has been standardized, are 



(3) 



a = x t y. 
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Then, we threshold 2 using the tuning parameter t > 0: 
(4) % = 3,4(1% | > t}. 

This requires 0{np) operations, two orders faster than the lasso for p n, 
and so is tractable for much larger problems. 

The lasso has mostly displaced marginal regression in practice. But the 
computational advantage for large problems prompts a second look. Tibshi- 
rani and Witten (author?) [27] have found that marginal regression some- 
times outperforms the lasso in predictive error. Here we revisit marginal re- 
gression as a tool for variable selection and ask whether there is any strong 
reason to prefer the lasso. If marginal regression exhibits comparable per- 
formance, theoretically and empirically, then it offers a plausible alternative 
to the lasso. Put another way: because of its simplicity, marginal regression 
only needs to tie to win. 

In this paper, we study the relative performance of the lasso and marginal 
regression in three different regimes. In Section [2j we compare the conditions 
that guarantee exact variable selection in the noise-free case and briefly 
compare the conditions for consistent variable selection (i.e. sparsistency) 
in the noisy case. The two sets of conditions are generally overlapping, and 
we give examples where each procedure fails while the other succeeds. One 
advantage of the lasso is that, given a fixed matrix X, the conditions for 
its success hold over a larger class of 0's than that of marginal regression. 
On the other hand, marginal regression has a larger tolerance for collinearity 
than does the lasso and is somewhat easier to tune, as we illustrate in Section 
1231 

In Section [3j we consider the regime where the design matrix X is fixed 
but the coefficient vector (5 is randomly generated. We find conditions such 
that marginal regression performs well with overwhelming probability. The 
main condition, which we call faithfulness, is closely related to both the 
Faithfulness Condition of [21] and the Incoherence Condition of [8]. The 
Incoherence Condition depends only on X and is thus checkable in practice, 
but it aims to control the worst case so is quite conservative. The Faithfulness 
Condition of [21 is relatively less stringent but depends on the unknown 
support of the parameter vector. Our version of the Faithfulness Condition 
strikes a compromise between the two. 

Although exact variable selection has been the focus of many studies in 
the literature, it is rare in practice to select exactly the right variables, so it 
is natural to measure performance in terms of the deviation from exact se- 
lection. In Section[4j we study the convergence rates of the two procedures in 
Hamming distance between sgn(/?) and sgn(/3). Our main result in this sec- 
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tion is a new partition of the parameter space into three regions I— III. In the 
interior of region I, exact variable selection is possible (asymptotically), and 
both procedures achieve this given properly chosen tuning parameters. In 
region II, it is possible to have a variable selection procedure that recovers 
most relevant variables, but not all of them. And in Region III, success- 
ful variable selection is impossible, and the optimal Hamming distance is 
asymptotically equivalent to the total number of relevant variables. 

Finally, in Section [5] we present simulation studies showing that marginal 
regression and the lasso perform comparably over a range of parameters. 
Section [6] gives the proofs of all theorems and lemmas in the order they 
appear. 

Notation. For a real number x, let sgn(x) be -1, 0, or 1 when x < 0, x = 0, 
and x > 0; and for a vector u £ R fc , define sgn(u) = (sgn(wi), . . . , sgn{uk)) T ■ 
We will use || • ||, with various subscripts, to denote vector and matrix norms, 
and | • | to represent absolute value, applied component-wise when applied 
to vectors. With some abuse of notation, we will write min u (min \u\) to de- 
note the minimum (absolute) component of a vector u. Inequalities between 
vectors are to be understood component-wise as well. 

2. Noise-Free Conditions for Exact Variable Selection. Consider 
a sequence of regression problems with deterministic design matrices, in- 
dexed by sample size n, 

(5) yW = j(«)/3(«) + z W, 
Here, and 

z (n) are nX i response and noise vectors, respectively, X^ 
is an n x p^ matrix and (3^ is a p^ x 1 vector, where we typically assume 
p(, n ) ^> n . "We assume that (3^ is sparse in the sense that it has nonzero 
components where <C p^ . By rearranging f3^ without loss of gener- 
ality, we can partition each and (3^ into "signal" and "noise" pieces, 
corresponding to the non-zero or zero coefficients, as follows: 

(6) i w = (4 n) -<) ^ n) = {Z)- 

In fact, we assume that (3^ £ Ai 8 ^ for a sequence > (and not 
converging to zero too quickly) with 

(7) = {x = (xx, . . . , x k ) T £ R k : \xj\ > a for all 1 < j < A;} , 

for positive integer k and a > 0. This commonly used condition on f3^ 
ensures that the non-zero components are not too close to zero to be in- 
distinguishable. Finally, define the Gram matrix = (X^^X^ and 
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partition this as 



(8) 



c 



(n) r {n) 



SS 



SN 



(n) n (n) 



r<v i i n 



NN 



where of course C^ s = (C s n ^-) T . Except in Sections |4 - 5 we suppose is 
normalized so that all diagonal coordinates of are 1. 

These ^ superscripts become tedious, so for the remainder of the paper, 
we suppress them unless necessary to show variation in n. The quantities 
X, C, p, s, p, as well as the tuning parameters A (for the lasso; see ([2])) and 
t (for marginal regression; see (Eh) are all thus implicitly dependent on n. 

(n) 

We use Adp to denote the space Ad S ( n y 

We will begin by specifying conditions on C, p, A, and t such that in the 
noise- free case, exact reconstruction of (3 is possible for the lasso or marginal 
regression, for all fis G M p . These in turn lead to conditions on C^ n \ p( n \ 
s (") j p( n ) ; \( n ) J and such that in the case of homoscedastic Gaussian 
noise, the non-zero coefficients can be selected consistently, meaning that 
for all sequences [3^ G = A4 P , 



(9) 



P 



sgr 



1, 



as n — > oo. (This property was dubbed sparsistency by Pradeep Ravikumar 
[23].) Our goal is to compare these conditions. In this section, we focus on 
the noise-free case, and keep the discussion on the noise case brief. 



2.1. Exact reconstruction conditions for the lasso in the noise-free case. 
We begin by considering three conditions in the noise-free case that are now 
standard in the literature on the lasso: 

Condition E. The minimum eigenvalue of Css * s positive. 
Condition I. (Irrepresentableness) 



max 



CnsC ss sgn(/%) 



Condition J. 



mm 



P S - AC^i sgn(/3, 



< 1. 



> 0. 



Because Css is symmetric and non-negative definite, Condition E is equiv- 
alent to Css being invertible. Later we will strengthen this condition. A 
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critical feature of Condition I is that it only depends on the sign pattern, as 
we will see. 

For the noise-free case, Wainwright [30, Lemma 1] shows that assuming 
Condition E, conditions I and J are necessary and sufficient for the existence 
of a lasso solution (3 with tuning parameter A such that 

sgn(/3) = sgn(/3). 

(See also [32 ). Note that this result is stronger than correctly selecting the 
non-zero coefficients, as it gets the signs correct as well. 

Maximizing the left-hand side of Condition I considers all 2 s sign patterns 
and gives HCatsCJ^Hoo, the maximum-absolute-row-sum matrix norm. It fol- 
lows that Condition I holds for all /3s € M. p if and only if \\CnsCssW 00 — ^ 
Similarly, one way to ensure that Condition J holds over A4 p is to require 
that every component of ACgg sgn(/3s) be less than p. The maximum com- 
ponent of this vector over M. p equals AHC^Hoo, which must be less than p. 
A simpler relation, in terms of the smallest eigenvalue of Css is 

(10) ~ ~ ' /^i — \ = V / ^l|Cs"?ll2 > llCorclloo > II Coo II2 = — TF. — r, 

eigen min (C ss ) bb bh eigen min (C ss ) 

where the inequality follows from the symmetry of Css an d standard norm 
inequalities. 

Stronger versions of the above conditions will be useful. 

Condition E'. The minimum eigenvalue of Css is no l> ess than Xq > 
0, where Ao does not depend on n. 

Condition I'. 

1 1 Cns Cg S 1 1 00 < 1 — Vi 
for < rj < 1 small and independent of n. 

Condition J'. 



Note that under Condition E', Condition J' can be replaced by the stronger 
condition A < pAo/y^- 

Theorem 1. In the noise-free case, Conditions E' (or E), I' (or I), 
and J' imply that for all (3s £ -M-p, there exists a lasso solution (3 with 
sgn(/3) = sgn(/3). 
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The conditions for exact reconstruction can be weakened. For instance, 
Conditions E, I, and J' are also sufficient for exact reconstruction. But we 
chose these forms because they transition nicely to the noisy case. In a 
later section, we will discuss Wainwright's result [30 showing that a slight 
extension of Conditions E', I', and J' gives sparsistency in the case of ho- 
moscedastic Gaussian noise. 

2.2. Exact reconstruction conditions for marginal regression in the noise- 
free case. As above, define a = X T Y and define (3 by (3j = ajl{\6tj\ > t}, 
1 < j < P- For exact reconstruction with marginal regression, we require 
that Pj ^ whenever (3j ^ 0, or equivalently \aj\ > t whenever [3j ^ 0. In 
the literature on causal inference, this assumption is called faithfulness [25] 
and is also used in [TJ [16] . The faithfulness assumption has received much 
criticism [24] . The usual justification for faithfulness assumptions is that 
if (3 is selected at random from some distribution, then faithfulness holds 
with high probability. The criticism in [21] is that results which hold under 
faithfulness cannot hold in any uniform sense. 



is required to correctly identify the non-zero coefficients. We call this the 
Faithfulness Condition even though it is technically different from the stan- 
dard definition of faithfulness above. We thus have: 

Lemma 1. Condition F is necessary and sufficient for exact reconstruc- 
tion with marginal regression. 



We write 




By elementary algebra 




It follows directly that 

Condition F. (Faithfulness) 



(11) 



m&x\C NS f3s\ < ram\C S s/3s\ 



Unfortunately, as the next theorem shows, Condition F cannot hold for 
all 13s £ M p . Applying the theorem to Css shows that for any p > 0, there 
exists a f3s £ M. p that violates equation (11). 
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Theorem 2. Let C be an s x s positive definite, symmetric matrix that 
is not diagonal. Then for any p > 0, there exists a [3 £ Aii such that 
mm\C(3\ = 0. 

Despite the seeming pessimism of Theorem [2] we need to be cautious 
about over- interpreting this result. Since C/3 = (Y, X), what Theorem [2] says 
is that, if we fix X and let Y = X/3 ranges through all possible (3 6 M s p , 
then there exists a Y such that min |(Y,X)| = 0. However, both X and Y 
are observed, and if min \ (Y, X)\ > 0, one can rule out the result of Theorem 
[2] Although Theorem [2] is sufficient but not necessary for failure of marginal 
regression, this mitigates the pessimism of the result. 

2.3. Comparison of the exact reconstruction conditions in the noise-free 
case. In this section, we compare conditions for exact reconstruction in the 
noise-free case required for the lasso and marginal regression. We will see 
that at the level of individual /3's, the conditions are generally overlapping 
and very closely related. Although the conditions for marginal regression do 
not hold uniformly over any A4 p , they have the advantage that they do not 
require invertibility of Css an d hence are less sensitive to small eigenvalues. 

We illustrate with a few examples, each in a subsection. 

2.3.1. For an individual (3, the condition for the lasso and that for marginal 
regression are generally overlapping. Consider an example where 



c ss = ; ; , p s 





We investigate when p ranges, which of two conditions is weaker than the 
other. Note that s = 2 so the matrix CVs only has two columns. Fix a row 
of Cats, say a = (oi, a^)- Condition I requires 

(12) K +a 2 | < 1 +p, 
and Condition F requires 

(13) \2a x +a 2 \ < min{(2 + p), (1 + 2p)}. 



Seemingly, for many choices of p, two conditions (jl2J) and (13j) overlap with 
each other. Take p = —0.75 for illustration. In Figure [T we display the 
regions where {a\,a-i) satisfy (12) and (13), respectively. The figure shows 
that two regions are overlapping. As a result, the condition for the lasso 
overlaps with that for marginal regression. For different choices of p and (3s, 
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those regions in Figure [T] may vary, but to a large extent, two conditions 
continue to overlap with each other. 

Examples for larger s can be constructed by letting Css be a block diago- 
nal matrix, where the size of each main diagonal block is small. For each row 
of Cns, the conditions for the lasso and marginal regression are similar to 



those in (12) and (13), respectively, but maybe more complicated. To save 



space, we omit further discussion along this line. 

2.3.2. In the special case of (5s oc Is, the condition for the lasso and the 
condition for marginal regression are closely related. Consider the special 
case (5s oc I5. In this case, the main condition for the lasso (Condition I) is 

(14) \C NS -Cs l s -ls\ < Is, 

and the condition for marginal regression (Condition F) is 

(15) \C NS -l s \ < \C SS -ls\, 

where both inequalities should be interpreted as hold component- wisely. Two 



conditions are surprisingly similar: removing C s $ on the left side of ( 14 ) and 



adding Css to the right hand side of it gives (15). 

Note that if in addition 1 s is an eigen- vector of C55, then two conditions 
are equivalent to each other. This includes but is not limited to the case of 
s = 2. 

2.3.3. In the special case of Css = I> the condition for the lasso is weaker. 
Fix n and consider the special case in which Css = I- F° r the lasso, Con- 
dition E' (and thus E) is satisfied, Condition J' reduces to A < p, and Con- 
dition I becomes ||Cjvs|| < 1- Under these conditions, the lasso gives exact 
reconstruction, but Condition F can fail. To see how, let (3 G { — 1, 1} S be the 
vector such that max|Cjvs/?| = ||Cjvs||oo and let I be the index of the row 
at which the maximum is attained, choosing the row with the biggest abso- 
lute element if the maximum is not unique. Let u be the maximum absolute 
element of row I of C^s with index j. Define a vector 5 to be zero except in 
component j, which has the value pflj / (u\\Cns\\oc)- Let j3 = p(3 + p5. Then, 

(16) \(C NS p)t\ = P\(pNsP)t\ + P ' 



IICatsIIoo 

(17) ='( iic " sii " + k) 

(18) > p. 
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p = -0.75 




-1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.S 1 




Fig 1. Let Css and (3s be as in Section \2.3.S\ where p — —0.75. For a row of Cns, say, 
(01,02). The interior of the red box and that of the green box are the regions of (01,02) 
satisfying the condition for the lasso (i.e. ) and for marginal regression (i.e. 
respectively. 



It follows that max \CnsP\ > P = mm \P\, so Condition F fails. 

On the other hand, suppose Condition F holds for all (3s £ {— 1, 
(It cannot hold for all M p by Theorem g. Then, for all S G 
max |Cat,s/3s| < 1; which implies that ||Cjvs||oo < 1- Choosing A < p, we 
have Conditions E', I, and J' satisfied, showing by Theorem [T] that the lasso 
gives exact reconstruction. 

It follows that the conditions for the lasso are weaker in this case. 

2.3.4. Small eigenvalues of (X' s Xs) may have an adverse effect on the 
performance of the lasso, but not always on that for marginal regression. 
For simplicity, assume 

As oc Is. 

(We remark that the phenomenon to be described below is not limited to 
the case of As oc lg). For 1 < i < s, let A» and £j be the i-th eigenvalue and 
eigenvector of Css- Without loss of generality, we assume that & have unit 
l 2 norm. By elementary algebra, there are constants ai,...,a B such that 
Is = ci^i + C2C2 + • • • + c,s£ s . It follows 

s C' s 

C SS ' 1 S = aIld CsS ' ls = ^2( C i X i)^i- 

i=l Ai i=l 
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Fix a row of CVs, say, a = (01, . . . ,a s ). Respectively, the conditions for the 
lasso and marginal regression require 



(19) 1(0,2^)1 <1 and |(a,ls)| < lEM^I" 



, -1 ^ /: J 



Without loss of generality, we assume that Ai is the smallest eigenvalue 
of Css- Consider the case where Ai is small, while all other eigenvalues 
have a magnitude comparable to 1. In this case, the smallness of Ai has 
a negligible effect on J2i=i( c i^i)£,i> an d so has a negligible effect on the 
condition for marginal regression. However, the smallness of Ai may have an 
adverse effect on the performance of the lasso. To see the point, we note that 



J2t=i ~ 17^1 • Compare this with the first term in (19). The condition 
for the lasso is roughly 

|(a,£i)l<Ai, 

which is rather restrictive since Ai is small. 

We now further illustrate the point with an example. Let 

/ 1 



Css 



-1/2 



V 



c 




The smallest eigenvalue of the matrix is Ai = Ai(c) = 1 — y/c 2 + 1 /4, which 
is positive if and only if c < y/3/2. Suppose < c < v3/2. Fix a row of 
CVs, say, a = (a±, 02, 03). By direct calculations, the condition for marginal 
regression and the lasso require 
(20) 

|ai + a 2 + a 3 | < 1/2, and |(6-4c)ai + (6-3c-4c 2 )a 2 + (3-6c)a 3 | < (3-4c 2 ) 
respectively. As c approaches \/3/2, both Ai(c) and the right hand side of 



the first inequality in (20) approach 0. As a result, the first inequality in 
(20) becomes increasingly more restrictive, but the the second inequality 
remains the same for all c. Therefore, a small Ai(c) has a negative effect on 
the broadness of the condition for the lasso, but not on that of marginal 
regression. 

Figure [2] displays the regions where the vector a satisfies the first and the 



second inequality in (20), respectively. In this example, c = 0.55,0.75,0.85, 
so Ai(c) = 0.29,0.14,0.014 correspondingly. To better visualize these re- 
gions, we display their 2-D section in Figure [2] (in the 2-D section, we set the 
first coordinate of a to 0). The figures suggest that when Ai(c) get increas- 
ingly smaller, the region corresponding to the lasso shrinks substantially, 
while that corresponding to marginal regression remains the same. 
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MR (for all c) 



Lasso (c = .55) 





Lasso (c = .75) 



Lasso (c = .85) 





Fig 2. The regions sandwiched by two hyper-planes are the regions of a = (01,02,03) 
satisfying the condition for marginal regression (first inequality in \2Cfy; Panel 1 ) and for 



the lasso (second inequality of (20' Panel 2-4)- Details of the example are in Section 2.3.4 
Here, c — 0.55,0.75,0.85 and the smallest eigenvalues ofCss are Ai(c) = 0.29,0.14,0.014. 
As c varies, the regions for marginal regression remain the same so is displayed only in 
Panel 1 (MR stands for marginal regression) . In contrast, the regions for the lasso get 
substantially smaller as Ai(c) decreases. 




Fig 3. Displayed are the 2-D sections of the regions in Figure^ where we set the first 
coordinate of a to 0. As c varies, the regions for marginal regression remain the same, but 
those for the lasso get substantially smaller as Ai(c) decrease, x-axis: a^. y-axis: 03. 
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In conclusion, in the noise-free case, the condition for the lasso and for 
marginal regression are generally overlapping and closely related. On one 
hand, given a fixed matrix X, the conditions for the success of the lasso 
hold over a larger class of /3's than that of marginal regression. On the other 
hand, marginal regression has a larger tolerance for collinearity than does 
the lasso. Bear in mind that for very large p, the lasso is computationally 
much more demanding. 

2.4. Exact reconstruction conditions for marginal regression in the noisy 
case. We now come back to Model (5) and consider the noisy case: 

(21) y(") = X^pW + z {n \ where z (n) ~ N(0, a 2 n ■ I n ). 

To focus on variable selection, we suppose the parameter a\ is known. The 
exact reconstruction condition for the lasso in the noisy case has been stud- 
ied extensively in the literature (see for example (26]). So in this paper, 
we focus on that for marginal regression. We address two topics. First, we 
extend Condition F in the noise- free case to the noisy case, say Condition 
F'. When Condition F' holds, we show that with an appropriately chosen 
threshold t (see Q ) , marginal regression fully recovers the support with high 
probability. Second, we discuss how to determine the threshold t empirically. 
Recall that in the noise-free case, Condition F is 

max\C NS (3s\ < min | Cssfls I • 
A natural extension of Condition F is the following. 
Condition F'. (Faithfulness) 

(22) max \C NS 0s\ + 2a n ^/2logp < min \C SS Ps\- 

When Condition F' holds, it is possible to separate relevant variables from 
irrelevant variables with high probability. In detail, write 

X = [x 1 ,x 2 ,...,x p ], 

where Xi denotes the i-th column of X. Sort | (Y, X{)\ in the descending order, 
and let = rj(Y, X) be the ranks of |(Y, Xi)\ (assume no ties for simplicity). 
Introduce 

S n (k)=S n (k;X,Y,p) = {i: n {X,Y)<k}, k = 1, 2, . . . ,p. 

Recall that £*(/?) denotes the support of f3 and s = \S\. The following lemma 
says that, if s is known and Condition F' holds, then marginal regression is 
able to fully recover the support S with high probability. 
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Lemma 2. Consider a sequence of regression models as in (21). If for 
sufficiently large n, Condition F' holds and p^ > n, then 

lim p(s n {s^-X^\Y^\p^)^SW i - n) ) \ =0. 

Lemma [2] is proved in the appendix. We remark that if both s and (p — s) 
tend to oo as n tends to oo, then Lemma [2] continues to hold if we replace 
2<7 n \/2 logp in (22 ) by a n (y/log(p — s) + \/log s). See the proof of the lemma 
for details. 

The key assumption of Lemma [2] is that s is known so that we know how 
to set the threshold t. Unfortunately, s is generally unknown. We propose 
the following procedure to estimate s. Fix 1 < k < p, let be the unique 
index satisfying 

r lk (X,Y)=k. 

Let V n (k) = V n (k; X, Y,p) be the linear space spanned by 



and let H n {k) = H n (k;X,Y,p) be the projection matrix from R n to V n (k) 
(here and below, the ~ sign emphasizes the dependence of indices ik on the 
data). Define 

6 n (k) = 6 n (k;X, Y,p) = \\{H n {k + 1) - H n (k))Y\\, 1 < k < p - 1. 

The term o^{k) is closely related to the F-test for testing whether @i k+1 ^ 0. 
We estimate s by 

s n = s n (X, Y,p) = max jl < k < p : 5 n (k) > a n \/2logn^ + 1 

(in the case where 6(k) < cr n ^2 log n for all k, we define s n = 1). 
Once s n is determined, we estimate the support S by 

S(s n ,X,Y,p) = {i k : fc = 1,2,.. . ,s n }. 

It turns out that under mild conditions, s n = s with high probability. In 
detail, suppose that the support S((3) consists of indices ji , j2 , . . • ,j s - Fix 
1 < k < s. Let Vs be the linear space spanned by xj 1 , . . . , Xj s , and let V^t-h) 
be the linear space spanned by Xj 1 , . . . , Xj k _ 1 , Xj k+1 ,xj s . Project Pj k Xj k 
to the linear space Vs fl Vs~(-k)' ^n(k, 0, X,p) be the £ 2 norm of the 
resulting vector, and let 



A* n ((3,X,p)= mm A n (k, f3, X,p). 

l<k<s 
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The following theorem says that if A* (/?, X, p) is slightly larger than a n \/2 log n, 
then s n = s and S n = S with high probability. In other words, marginal 
regression fully recovers the support with high probability. Theorem [3] is 
proved in the appendix. 



Theorem 3. Consider a sequence of regression models as in (21). Sup- 
pose that for sufficiently large n, Condition F' holds, p^ > n, and 



Then 



lim pf sJlW.FW.pW) ^ s ( n A 

n— »oo y y 



lim i(s ri (lW,KW ) pW);lW ) yW,n,DW) ^ 5(^) ^0. 



erne/ 



Theorem [3] says that the tuning parameter for marginal regression (i.e. the 
threshold t) can be set successfully in a data driven fashion. In comparison, 
how to set the tuning parameter A for the lasso has been a withstanding 
open problem in the literature. 

3. The Deterministic Design, Random Coefficient Regime. Re- 
call that the Faithfulness Condition is 

max \C NS f3s\ < min \C S sPs\- 

In this section, we study how broad the Faithfulness Condition holds. We 
approach this by modeling (3 as random (the matrix X is kept deterministic), 
and find out conditions under which the Faithfulness Condition holds with 
high probability. 

The discussion in this section is closely related to the work by Donoho 
and Elad [8] on the Incoherence Condition. Compared to the Faithfulness 
Condition, the advantage of the Incoherence Condition is that it does not 
involve the unknown support of (3, so it is checkable in practice. The down- 
side of the Incoherence Condition is that it aims to control the worst case 
so it is conservative. In this section, we derive a condition — Condition F"— 
which can be viewed as a middle ground between the Faithfulness Condition 
and the Incoherence Condition: it is not tied to the unknown support so it 
is more tractable than the Faithfulness Condition, and it is also much less 
stringent than the Incoherence Condition. 
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In detail, we model (3 as follows. Fix e G (0, 1), a > 0, and a distribution 
7r, where 

(23) the support of tt C (— oo, —a] U [a, oo). 

For each 1 < i < p, we draw a sample Bi from Bernoulli(e). When B{ = 0, 
we set (3i = 0. When i?j = 1, we draw ft ~ 7r. Marginally, 

(24) ft ^ d (1 - e)^ + evr, 

where fo denotes the point mass at 0. We study for which quadruplets 
(X, e, tt, a) the Faithfulness Condition holds with high probability. 

Recall that the design matrix X = [x%, . . . , x p ], where x% denotes the z-th 
column. Fix t > and 5 > 0. Introduce 

9ij (t) = E n [e tu ^] - 1, Ut) = 

where the random variable u ~ 7r. As before, we have suppressed the super- 
script ( n ) for gij(t) and ^(i). Define 

A n (5, 6, 5) = A n (6, 6, g; X, tt) = mm ( e~ St f>^« + e e ^~ ty 



i=l 



where g denotes the vector (gi, . . . ,g v ) T . The following lemma is proved in 
the appendix. 

Lemma 3. Fix n, X , 5 > 0, e E (0, 1), and distribution tt. Then 

(25) P(max\C NS Ps\ > 5) < (1 - e)A n (S,e,g; X,tt), 
and 

(26) P(max\(C SS - Ws\ >S) <eA n (5,e,g;X,Tr). 

Now, suppose the distribution tt satisfies d23j) for some a > 0. Take S = a/2 



on the right hand side of (25)-(26). Except for a probability of A n {a/2, e,g), 
max \C NS f3s\ < a/2, mm \C S sf3s\ > min \(3 S \ -max \(Css-I)Ps\ > a/2, 

so max \CnsPs\ < mm -\Css^s\ & n d the Faithfulness Condition holds. This 
motivates the following condition, where (a, e, tt) may depend on n. 
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Condition F". (Faithfulness) 

(27) lim A n {a n /2,e n ,g {n) ;X^\it n ) = 0. 

n^oo 

The following theorem says that if Condition F" holds, then Condition F 
holds with high probability. 

Theorem 4. Consider a sequence of noise- free regression models as in 



(21), where the noise component = and [3^ is generated as in (24) ■ 
Suppose Condition F" holds. Then as n tends to oo, except for a probability 
that tends to 0, 

max \C NS (3 S \ < min \C S sPs\- 

Theorem [4] is the direct result of Lemma [3] so we omit the proof. 

3.1. Comparison of Condition F" with the Incoherence Condition. In- 
troduced in Donoho and Elad [8] (see also [9]), the Incoherence of a matrix 
X is defined as [8] 

IXLclX | C{ j | 5 

where C = X T X is the Gram matrix as before. The notion is motivated by 
the study in recovering a sparse signal from an over-complete dictionary. In 
the special case where X is the concatenation two orthonormal bases (e.g. 
a Fourier basis and a wavelet basis), max^- |CV,| measures how coherent 
two bases are and so the term of incoherence; see [HJ [9] for details. Consider 
Model ([T| in the case where both X and j3 are deterministic, and the noise 
component z = 0. The following results are proved in (5j El [9] . 

• Lasso yields exact variable selection if s < 1 + max '/j J^j I 

•> 2maxi^j \Cij\ 

• Marginal regression yields exact variable selection if s < „ a - c , n , 
for some constant c G (0,1), and that the nonzero coordinates of (3 
have comparable magnitudes (i.e. the ratio between the largest and 
the smallest nonzero coordinate of (3 is bounded away from oo). 

In comparison, the Incoherence Condition only depends on X so it is 
checkable. Condition F depends on the unknown support of (3. Checking 
such a condition is almost as hard as estimating the support S. Condition 
F" provides a middle ground. It depends on (3 only through (e, it). In cases 
where we either have a good knowledge of (e, tt) or we can estimate them, 
Condition F" is checkable. 

At the same time, the Incoherence Condition is conservative, especially 
when s is large. In fact, in order for either the lasso or marginal regression 
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to have an exact variable selection, it is required that 

r 



(28) max < O 

In other words, all coordinates of the Gram matrix C need to be no greater 
than 0(l/s). This is much more conservative than Condition F. 

However, we must note that the Incoherence Condition aims to control the 
worst case: it sets out to guarantee uniform success of a procedure across all 
j3 under minimum constraints. In comparison, Condition F aims to control 
a single case, and Condition F" aims to control almost all the cases in a 
specified class. As such, Condition F" provides a middle ground between 
Condition F and the Incoherence Condition, applying more broadly than 
the former, while being less conservative than the later. 

Below, we use two examples to illustrate that Condition F" is much less 
conservative than the Incoherence Condition. In the first example, we con- 
sider a weakly dependent case where max^- \Cij\ < 0(l/log(p)). In the 
second example, we suppose the matrix C is sparse, but the nonzero coor- 
dinates of C may be large. 

3.1.1. The weakly dependent case. Suppose that for sufficiently large n, 
there are two sequence of positive numbers a n < b n such that 

the support of ir n C [-b n , -a n ] U [a n , b n ], 

and that 

— • max \ Cij\ < ci/log(p), c\ > is a constant. 

For k > 1, denote the fc-th moment of ir n by 

(29) = 
Introduce m n = m n (X) and v\ = v n (X) by 

m n (X) = pe n ■ max { - ^ dj \ , v 2 n (X) = pe n ■ max { - ^ Cf ■ } . 



Corollary 3.1. Consider a sequence of regression models as in (21) 



where the noise component = and (3^ is generated as in (24)- If th 
are constants c\ > and C2 £ (0, 1/2) such that 



ere 



b n 



i^«{|Cii|}<ci/log(pW), 
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and 
(30) 

Em ( ^(IW)) < c 2 , lim ( ^P^vl(X^) lo g {p^)) = 0, 



n— >oc ' 



then 

lim A n {a n /2,e n ,g^;X^,7r n ) = 0, 



n— >oo 

and Condition F" holds. 



Corollary 3.1 is proved in the appendix. For interpretation, we consider 
the special case where there is a generic constant c > such that b n < ca n . 
As a result, fj,„ /a n < c, p^ ja? n < c 2 . The conditions reduce to that, for 
sufficiently large n and all 1 < i < p, 



lA„ „, 1 . 1 P 



E ^il ^ °(— )> :E^ = o(i/p€„). 



p £J p & 



Note that by (24), s = s' n ) ~ Binomial (p, e n ), so s w pe n . Recall that the 



Incoherence Condition is 



max|Cjj| < 0(1/ s). 



In comparison, the Incoherence Condition requires that each coordinate of 
(C — I) is no greater than 0(l/s), while Condition F" only requires that 
the average of each row of (C — I) is no greater than 0(1/ s). The latter is 
much less conservative. 

3.1.2. The sparse case. Let N*(C) be the maximum number of nonzero 
off-diagonal coordinates of C: 

N*(C) = max {N n (i)}, N n (i) = N n (i; C) = #{j : j + i, C i5 + 0}. 
i<i<p 

Suppose there is a constant C3 > such that 

Also, suppose there is a constant C4 > such that for sufficiently large n, 
(32) the support of ir n is contained in [— C4a„, a n ] U [a n , Cib n ]. 
The following corollary is proved in the appendix. 
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Corollary 3.2. Consider a sequence of noise-free regression models as 



in (21), where the noise component = and f)^ is randomly generated 



as in (24\ l. Suppose (31)-(32) hold. If there is a constant 5 > such that 
(33) max|Cj, | < 5, and 5 < 

i£j 2C4 



then 



lim A n (a n /2,e n ,g^;X^,7r n ) = 0, 



n— >oo 

and Condition F" holds. 

For interpretation, consider a special case where 

In this case, the condition reduces to 

N*(C) < /" 2c45 . 

As a result, Condition F" is satisfied if each row of (C — I) contains no 
more than p®-" 20 ^ nonzero coordinates each of which < 5. Compared to the 
Incoherence Condition max^y |Cy| < 0(1/ s) = 0(p ), our condition is 
much weaker. 

In conclusion, if we alter our attention from the worst-case scenario to 
the average scenario, and alter our aim from exact variable selection to ex- 
act variable selection with probability ~ 1, then the condition required for 
success — Condition F" — is much more relaxed than the Incoherence Condi- 
tion. 

4. Hamming Distance when X is Gaussian; Partition of the 
Phase Diagram. So far, we have focused on exact variable selection. In 
many applications, exact variable selection is not possible. Therefore, it is 
of interest to study the Type I and Type II errors of variable selection (a 
Type I error is a misclassified coordinate of /?, and a Type II error is a 
misclassified nonzero coordinate). 

In this section, we use the Hamming distance to measure the variable 
selection errors. Back to Model 0, 

(34) Y = X(3 + z, z~iV(0,/ n ), 

where without loss of generality, we assume a n = 1. As in the preceding 



section (i.e. (24)), we suppose 



(35) ft ~ (1 - e)uo + evr. 
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For any variable selection procedure (3 = (3(Y;X), the Hamming distance 
between (5 and the true (5 is 

d0\X) = d0;e,n\X) = £ E e , w (E z [l(^0j) + sgn(^))|X]). 

3=1 

Note that by Chebyshev's inequality, 

P (non-exact variable selection by (5(Y]X)) < d{(3\X). 

So a small Hamming distance guarantees exact variable selection with high 
probability. 

How to characterize precisely the Hamming distance is a challenging prob- 
lem. We approach this by modeling X as random. Assume that the coordi- 
nates of X are iid samples from N(0, 1/n): 

(36) Xij ~ AT(0,l/n). 

The choice of the variance ensures that most diagonal coordinates of the 
Gram matrix C = X T X are approximately 1. Let Px(x) denote the joint 
density of the coordinates of X. The expected Hamming distance is then 



d*{[3) = d*{(3;e,ir) = J d(fce,Tr\X = x)P x {x)dx. 

We adopt an asymptotic framework where we calibrate p and e with 

(37) p = n l ' e , e n = n^/ e ^p l -\ 0<0,#<1. 

This models a situation where p 3> n and the vector [3 gets increasingly 
sparse as n grows. Also, we assume 7r n in (24) is a point mass 

(38) TT n = U Tn . 

Despite its seemingly idealistic, the model was found to be subtle and rich 
in theory (e.g. [21 [TQl [TTJ [121 EE E2] ) • In addition, compare two experiments, 
in one of them ir n = v Tn , and in the other the support of 7r n is contained in 
[r n , oo). Since the second model is easier for inference than the first one, the 
optimal Hamming distance for the first one gives an upper bound for that 
for the second one. 

With e n calibrated as above, the most interesting range for r n is 0{\J2 logp) 
|1U| : when r n 3> y/2 logp, exact variable selection can be easily achieved by 
either the lasso or marginal regression. When r n <C \/2 log j>, no variable 
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selection procedure can achieve exact variable selection. In light of this, we 
calibrate 



(39) Tn = y/2(r/0)logn = ^2rlogp, r > 0. 

With these calibrations, we can rewrite 

<C(/3;e,7r) = d* n 0; e n ,r n ). 

Definition 4.1. Denote L(n) by a multi-log term which satisfies that 
lim Tl _» 0O (L(n) ■ n 5 ) = oo and that lim n ^ 00 (L(n) • n~ s ) = for any 5 > 0. 

We are now ready to spell out the main results. Define 

p{ti) = (1 + VT^) 2 , O<0<1. 

The following theorem is proved in the appendix, which gives the lower 
bound for the Hamming distance. 

Theorem 5. Fix d £ (0, 1), 9 > 0, and r > such that 9 > 2(1 - <&). 
Consider a sequence of regression models as in (34)-(39). As n — > oo, for 
any variable selection procedure f3^ n \ 

{ 1 (0+r) 2 
L(n)p 4r r > i?, 

(1 +o(l)) - p , < r < i?. 

At the same time, let (3 mr be the estimate of using marginal regression 
with threshold 

tn = (^^ A y^) • V21ogp. 

We have the following theorem. 

Theorem 6. Fix § e (0, 1), r > 0, and 6* > (1 — ??). Consider a sequence 



of regression models as in \34])-(39). As p — ► oo ; i/ie Hamming distance of 



marginal regression with the threshold t n = (^7^ A ^/r) • \/21ogj>( n ) satisfies 



^2 



L(n)jj 4r r > 1?, 

(1 + o(l)) -p L , < r < i?. 



Similarly, choosing the tuning parameter A„ = 2(|S= A y / r)y / 21ogp in the 
lasso, we have the following theorem. 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Fig 4. There regions as in Section [7J In Region of Exact Recovery, both the lasso and 
marginal regression yield exact recovery with high probability. In Region of Almost Full 
Recovery, it is impossible to have large probability for exact variable selection, but the 
Hamming distance of both the lasso and marginal regression <siC pe n . In Region of No 
Recovery, optimal Hamming distance ~ pe n and all variable selection procedures fail com- 
pletely. Displayed is the part of the plane corresponding to < r < 4 only. 



Theorem 7. Fix $ e (0, 1), r > 0, and 9 > (1 — i?). Consider a sequence 



of regression models as in (3^ - ( 39 ). As p — ► oo ; the Hamming distance of 
the lasso with the tuning parameter X n = 2(|^= A ^fr) ■ \J 2 logp( n ) satisfies 



'1 +o(l)) - p , < r < 1?. 



The proofs of Theorems |6]{7] are routine and we omit them. 
Theorems [5]{7] say that in the #-r plane, we have three different regions, 
as displayed in Figure |4] 

• Region I (Exact Reovery): < i? < 1 and r > />(#). 

• Region II (Almost Full Recovery): < i? < 1 and -i? < r < p($). 

• Region III (TVo Recovery): < i? < 1 and < r < 

In the Region of Exact Recovery, the Hamming distance for both marginal 
regression and the lasso are algebraically small. Therefore, except for a prob- 
ability that is algebraically small, both marginal regression and the lasso give 
exact recovery. 

In the Region of Almost Full Recovery, both the Hamming distance of 
marginal regression and the lasso are much smaller than the number of 
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relevant variables (which ~ pe n ). Therefore, almost all relevant variables 
have been recovered. Note also that the number of misclassified irrelevant 
variables is comparably much smaller than pe n . In this region, the optimal 
Hamming is algebraically large, so for any variable selection procedure, the 
probability of exact recovery is algebraically small. 

In the Region of No Recovery, the Hamming distance ~ pe n . In this 
region, asymptotically, it is impossible to distinguish relevant variables from 
irrelevant variables, and any variable selection procedure fails completely. 

The results improve on those by Wainwright [30] • It was shown in [30 that 
there are constants C2 > c\ > such that in the region of {0 < $ < 1, r > 
C2}, the lasso yields exact variable selection with overwhelming probability, 
and that in the region of {0 < 1) < 1, r < C2}, no procedure could yield 
exact variable selection. Our results not only provide the exact rate of the 
Hamming distance, but also tighten the constants c\ and C2 so that c\ = 

C 2 = (1 + VT^$) 2 . 

5. Simulations and Examples. In this section we consider some nu- 
merical examples. Figures [5] and [6] show the prediction error and the Ham- 
ming error for the lasso and marginal regression, as a function of the number 
of variables selected. In all cases, n = 40, p = 500, a = 10 and s = 100. 
All nonzero /5j's are set equal to either .5 or 5. Each row of the matrix X 
is generated independently from iV(0, where is a p by p matrix 
with 1 on the diagonal and p elsewhere. We take p = 0, 0.2, 0.5, 0.9. Figures 
[7] and [8] are the same but are averaged over 100 replications. 

We see that in virtually all cases, marginal regression is competitive with 
the lasso and in some cases is much better. 

6. Proofs. 

6.1. Proof of Theorem [^| First, let ki denote the number of non-zero 
diagonal entries in row i of C. Because C is symmetric but not diagonal, 
at least two rows must have non-zero ki. Assume without loss of generality 
that the rows and columns of C are arranged so that the rows with non- 
zero ki form the initial minor. It follows that the initial minor is itself a 
positive definite symmetric matrix. And because any such matrix A satisfies 
\Aij\ < maxfc Ckk for j / i, there exists a row i of C with ki > and 
\Cij\ < Ca for any j ^ i. 
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Fig 5. Prediction error (left column) and Hamming error (right column) for the lasso (red 
circle) and marginal regression (solid line). The x-axis displays the number of variables 
included. Row 1-4: p = 0, 0.2, 0.5, 0.9. Nonzero (3it 0.5. Results are based on one replication. 
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Fig 6. Same as in Figure^ but all nonzero pi equal 5. 
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Fig 



7. 



Same as in Figure^ but displayed are the average errors across 100 



replications. 
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Fig 8. Same as in Figure^ but displayed are the average errors across 100 replications. 
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Define (3 as follows: 
(40) ft = { 



( eg*, ifj^i and C„ / 



p if j ^ i and Cij = 
-hp if i = 

Because \Cij\ < Cu, this satisfies \(3j\ > p, so (3 £ Ai s p . Moreover, 

(41) {Cp)i = Y i Ci j p i = -k i CiLp+ £ ^ C v = °- 

j 

This proves the theorem. 

6.2. Proof of Lemma^ By the definition of S n (s), it is sufficient to show 
that except for a probability that tends to 0, 

max|X^y| <min|X|Y|. 

Since Y = X(3 + z = X S (3 S + z, we have X^Y = X%(X s f3 s + z) = 
CnsPs + XjjZ. Note that xfz ~ N(0,a^). By Boolean algebra and ele- 
mentary statistics, 

C s 

P(max \X%z\ > a nV /21og p) < V P{\{x i: z)\ > <r nA /21ogp) < 

iiiv vlogp P 

It follows that except for a probability of o(l), 

max < max \CnsPs\ + max | A"_^z| < max \CnsPs\ + o- n ^/2\ogp. 

Similarly, except for a probability of o(l), 

min \X%Y\ > min |CssAsi - max \X^z\ > min ICssAsI - a n ^/2 log p. 

Combining these gives the claim. □ 

6.3. Proof of Theorem^ Once the first claim is proved, the second claim 
follows from Lemma [2} So we only show the first claim. Write for short 
S n (s) = S n {s^;X^ n \Y^ n \p^), s = and S = S(f3^). All we need to 
show is 

lim P(s n 4 s) = 0. 

n— >oo 

Introduce the event 

D n = {S n (s) = S}. 



30 GENOVESE ET AL 

It follows from Lemma [2] that 

P(D c n ) - 0. 

Write 

P(s n ^s)< P(D n )P(s n + s\D n ) + P(D c n ). 
It is sufficient to show lim n ^ 00 P(s n ^ s\D n ) = 0, or equivalently, 

(42) lim P(s n > s\D n ) = and lim P(s n < s\D n ) = 0. 



Consider the first claim of (42 ). Write for short t n = <J n y/2 log n. Note that 
the event {s n > s\D n } is contained in the event of U?~ s {<5 n (A;) > t n \D n }. 
Recalling P(D n ) = o(l), 

p— l p— l 

(43) P(s n >s)< J2(S(k) > t n \D n ) < P(S n (k) > t n ), 

k=s k=s 

where we say two positive sequences a n < b n if lim n ^ 00 (a n /b n ) < 1. 

Fix s < k < p — 1. By definitions, ff(/c + 1) — H{k) is the projection 
matrix from R n to T^i(fc + 1) D Vn^k)^ . So conditional on the event {V n (k + 
1) = Vn(k)}, S n (k) = 0, and conditional on the event {V n (k + 1) C V^(fc)}, 
<y2(fc) ~ alx 2 [t)- Note that P(x 2 (l) > 21ogn) = o(l/n). It follows that 

p-i p-i 

£ P(J>n{k) >t n ) = J2 P($n{k) > t n \V n (k) C ? n (fc + l))P(F n (fc) C K(fc + 1)) 

1 ^ ^ 

(44) =f)EW)^#+l)). 

n , 

k=s 

Moreover, 

p— l p— l 

£ £ ^n(fc + 1)) = E ^[l(dim(^ n (fc + 1)) > dim(K(A:)))] 

p-1 

= E[J2 l(dim(^ n (fc + 1)) > 6im(V n {k)))]. 

k=s 

Note that for any realization of the sequences V n (l), . . . , V n (p), X)fc=s l(dim(T^ l (fe+ 
1)) > dim(K(fe))) < n. It follows that 

p-i 

(45) ^P(V n (k) CV n (k + l)) <n. 

k=s 
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Combining (43)-(45) gives the claim. 



Consider the second claim of (42). By the definition of s n , the event 
{s n < s\D n )} is contained in the event {5 n (s — 1) < t n \D n }. By definitions, 
5 n (s — 1) = \\(H(s) — H(s — 1))Y\\, where || • || = || • H2 denotes the I 2 norm. 
So all we need to show is 

(46) lim P(\\(H(s) - H(s - 1))Y\\ < t n \D n ) = 0. 

n — >oo 

Fix 1 < k < p. Recall that if. denotes the index at which the rank 
of |(y, Xi k )\ among all |(Y,aj,-)| is k. Denote X{k) by the n by k matrix 
[xi-t, Xi 2 , . . . , Xi k ], and denote 0(k) by the A;- vector (/%,/?i 2 , . . . , 0i k ) T - Con- 
ditional on the event D n , S n (s) = S, and (3^ , /3j 2 , . . . , (3i a are all the nonzero 
coordinates of (5. So according to our notations, 

(47) X(3 = X(s)(3(s) = X(s - l)(3(s - 1) + (3 is x is 

Now, first, note that H{s)X(s) = X(s) and H(s - l)X(s - 1) = X(s - 1). 
Combine this with (1471). It follows from direct calculations that 



(48) (H(s) - H(s - 1))X0 = (I - H(s - l))x ls . 
Second, since Xi s E V n (s), (I — H(s))xi g = 0. So 

(49) {I-H s ^)x ia = (I-H(s))x is + (H(s)-H(s-l))x is = (H s -H s ^)x is . 

Last, split xi a into two terms, Xi a = xp + x^ such that x^' E V n (s — 1) 

and xf^ E V n (s) n (V n (s - I)) 1 . It follows that (H(s) - H{s - l^xff = 0, 
and so 

(50) (H(s) - H(s - l))x is = (H(s) - H(s - l))xf? . 
Combining (|48|)-([50|) gives 

(51) (H(s) - H(s - 1))X0 = (H(s) - H(s - l))*^ . 
Recall that Y = X/3 + z, it follows that 

(52) (H s - H a -!)Y = (H(s) - H(s - l)){l3 la xf^ + z). 

Now, take an orthonormal basis of R n , say gi, <j2, ■ ■ ■ , q n , such that q\ E 
V n (s) n V n (s - l)- 1 , q 2 , . ■ ■ ,q s E V n (s - 1), and q s+1 , . . . , q n E V^s) 1 . Recall 

(2) ^ ^ 1 

that x\ is contained in the one dimensional linear space V n (s)nV n (s — 1) , 

(2) ^ (2) 

so without loss of generality, assume (x\ , gi) = \\x\ ||. Denote the square 
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matrix [q\, . . . , q n ] by Q. Let z = Qz and let z\ be the first coordinate of z. 
Note that marginally z\ ~ iV(0,<7^). Over the event D n , it follows from the 
construction of Q and basic algebra that 



(53) 



{H{s)-H{s-mp is xf} + z 



.(2) I 



Combine (|52j) and (|53j), 

\\ { H{ s )-H{s-l))Yf 
As a result, 



over the event D n . 



(54) P(\\(H(s) - H(s - 1))Y\\ < t n \D n ) = P((\\ (3 is xff || + zi) 2 < t n \D n ). 

Recall that conditional on the event D n , S n (s) = S. So by the definition 

of A* = A n ((3,X,p), 

W&.xPw > A* 



and 
(55) 



^((llA^II+^lf < tn\ D n) < P(\\M' } \\+Zl < t n \D n ) < P(A;+Zx < t n \D n ). 

Recalling that z x ~ N(0,al) and that P{D c n ) = o(l), 

(56) P(A* n + z! < i n |I> n ) < P(A* + ?! < t n ) + o(l). 

Note that by the assumption of (— 11 — i n ) — > oo, -P(A* + z\ < t n ) = o(l). 
Combining this with (55)-(56) gives 

(57) P{{m a x^\\ + z l f<tl\D n )=o{l). 



(2), 



Inserting (57) into d54l gives (46). □ 



6.4. Proof of Lemma^ For 1 < i < p, introduce the random variable 

v 

Zi = Pj( x ii x j)- 

When Bi = 0, 0i = 0, and so Zi = Y%=\ Pj{ x ii x j)- By the definition of Cns, 

v 

max|CW?s| = max{(l - Bi) • | V frfa, Xj)\} = max{(l - Bi)\Zi\}. 

Ki<n * — * Ki<p 



3=1 
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Also, recalling that the columns of matrix X are normalized such that 
(xi,Xi) = 1, the diagonal coordinates of (Css — I) are 0. Therefore, 

max|(C S 5 - I)Ps\ = m&*{Bi ■ | V (3j(xi, Xj)\} = max{5; • \Zi\}. 

l<i<p ~^ 1<«<P 

Note that Zi and Bi are independent and that P(Bi = 0) = (1 — e). It follows 
that 



P(m a x\C NS (3 s \ > 5) <J2 p ( B i = 0)P(\Zi\ > 5\B t = 0) = (l-e)J2P(\Zi\ > 

i=l i=l 

and 

p p 
P(m a x\(C SS -I)Ps\ >S) <J2P(B i = l)P(\Z i \>5\B l = l) = eJ2 P (\ Z *\ > 

i=l i=l 

Compare these with the lemma. It is sufficient to show 

(58) P(\Zi\ >S)< e- St [e eSi ® + e e ~ 9 ^~% 

Now, by the definition of gij(t), the moment generating function of Zi 
satisfies that 

(59) E[e tz *] = Eie^i^^] = n^[l + e 9ij (t)}. 

Since 1 + x < e x for all x, 1 + £Qij{t) < e e9ij ^\ so by the definition of cji(t), 

(60) E[e tz '] < = e e ~ 9i{t) . 
It follows from Chebyshev's inequality that 

(61) P{Z, >S)< e~ 5t E[e tZi ] < e - **e*W. 
Similarly, 

(62) P(Z< < -S) < e - 5t e e ~ 9 ^ 



Inserting (61)-(62) into <\58ti gives the claim. □ 
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6.5. Proof of Corollary 3.1 Choose a constant q such that q/2 — c<iq > 1 
and let t n = q\og{p)/a n . By the definition of A n (a n /2, e n , g), it is sufficient 
to show that for all 1 < i < p, 

e -a n t n /2 e e n gi(t„) _ M/p\ g-a„t„/2 e e„<?i(-tn) _ (\lp\ 

The proofs are similar, so we only show the first one. Let u be a random 
variable such that u ~ 7r n . Recall that the support of \u\ is contained in 
[a n ,6 n ]. By the assumptions and the choice of t n , for all fixed i and j ^ i, 
\t n u(xi,Xj)\ < qlog(p)(b n /a n )\(xi,Xj)\ < C\q. Since e x — 1 < x + e x x 2 /2, it 
follows from Taylor expansion that 



t n 1l(Xi )Xj ) 



1] < en^2E nn [t n u(xi,Xj) + —t n u 2 (xi,Xj) 2 ). 



By definitions of m n (X) and v 2 (X), e n E Tr[t n u(x h Xj)] = t n ^'m n (X), 
and e n J2j^iE- Kn [t 2 l u 2 (x i ,Xj) 2 } = t 2 /Jh^v 2 (X). It follows from (30) that 



e n 9i{t n ) < glog(p) 



\h L n 



m n (X) + 



e cl9 Mn } 2 



2 a 2 



u n (X)glog(p)] < gc 2 log(p). 



Therefore, 

e -a n t n /2 e € n gi(t n ) < g-[g/2-C2g+o(l)] log(p) 

and claim follows by the choice of q. n 



6.6. Proof of Corollary 



3.2 



Choose a constant q such that 2 < q < 



Let t n = a n qlog(p), and u be a random variable such that u ~ LT n . Similar 
to the proof of Lemma 3.1 we only show that 

e -ant n /2 e e n gi(tn) _ (l/p) s for all 1 < * < p. 

Fix i / j. When (x^ Xj) = 0, e tu ( Xi ' x ^ — 1 = 0. When (a;*, Xj) ^ 0, e t nu(x i ,x j )_ 
j < e tn(bn/a„)5 < eC4 g5logp_ Alg0; €nN * < e _[ C3 +o(i)]log(p)_ Therefore, 

e n ft(*) < e n N*e C4qSlog{p) < e -[c 3 - C 49<5+o(l)]iogp 

By the choice of g, C3 — c^qd > 0, so e n 9i(t) = o(l). It follows that 

e -a n t n /2 e t n gi(t n ) < (g-a™*n/ 2 ) _ ^g-glog(p)/2^ 

which gives the claim by q > 2. □ 
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6.7. Proof of Theorem^ Write 

X = [x u X], /3=(j3 1 Jf. 
Fix a constant cq > 3. Introduce the event 

(63) D n (c ) = {lpIXsls < |5|[1 + + V2c logp)] 2 ,for all 5}. 



The following lemma is proved in Section 6.7.1 

Lemma 4. Fix c > 3. As p — > oo, 

P(^(c ))= O (l/p 2 ). 

Since d ra (/3|A^) < p for any variable selection procedure /?, Lemma [^im- 
plies that the overall contribution of to the Hamming distance d* n {(5) is 
o(l /p). In addition, write 

By symmetry, it is sufficient to show that for any realization of (X, (3) G 
D n (co), 

{ (i9 + r) 2 
L(n)p" - , r>tf 
p , < r < 

where L(n) is a multi-log term that does not depend on (X, (3). 

We now show (64). Toward this end, we relate the estimation problem 
to the problem of testing the null hypothesis of j3± = versus the alterna- 
tive hypothesis of f3\ / 0. Denote 4> by the density of N(0, 1). Recall that 
X = [x\,X] and j3 = (/3i,/5) T . The joint density associated with the null 
hypothesis is 

My) = My; T n ,n\x)<f>{ y - xp)dp = </>( y ) J e^-wi 2 /*^ 

and the joint density associated with the alternative hypothesis is 

fi(y) = fl(y,en,T n ,n\X) = J 4>(y - r n xi - Xf3)d/3 
(65) = </>(y - r nXl ) [ ^WmV^^xli^ 
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Since the prior probability that the null hypothesis is true is (1 — e n ), the 
optimal test is the Neyman-Pearson test that rejects the null if and only if 

h(y) > (l-g 
fo(y) ~ e n 

The optimal testing error is equal to 

1 - ||(1 - e„)/ - e n /i||i. 



Compared to (J2J), || • ||i stands for the L 1 -distance between two functions, 
not the I 1 norm of a vector. 

We need to modify f\ into a more tractable form, but with negligible dif- 
ference in L 1 -distance. Toward this end, let N n (P) be the number of nonzeros 
coordinates of (5. Introduce the event 

B n = {\N n 0)-pe n \ < ^pe n }. 

Let 



(66) a n (y) = a n (y; e n , r n \X) 



f( e -yTXti-\XW/2) ■ l {B} df) 



Note that the only difference between the numerator and the denominator 
is the term e - T n x T x P which ~ 1 with high probability. Introduce 

(67) Uy) = a n (y)<l>(y - r nXl ) [ e^'^^dp. 



The following lemma is proved in Section 6.7.2 



Lemma 5. As p — > oo, there is a generic constant c > that does not 
depend on y such that \a n (y) — 1| < clog(p)p^ 1 ~^^ 9 ^ 2 and \\fi — fi\\\ = 
o(l/p). 

We now ready to show the claim. Define O n = {y : a n (y)(f>(y — r n x\) > 
4>(y)}. Note that by the definitions of fo(y) and fi(y), y G ^ n if an d only if 

e «/i(y) > x 



(1 - e n )f (y) 
By Lemma [5] 

| J h(y)dy-l\ < H/1-/1II1 <o(l/p). 
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It follows from elementary calculus that 



1 - ||(1 - e n )/ - e n fx\\t = / (1 - tn)My)dy + / e n h{y)dy + o(l/p). 
Using Lemma [H] again, we can replace f± by f\ on the right hand side, so 
1 - ||(1 - en)/o - ejxh = [ (1 - e n )fo(y)dy + [ e n h{y)dy + o(l/p). 



At the same time, let S p = c\og{p)p <yl ^ 6 / 2 be as in Lemma [5] and let 

t = t (#,r) = ^L^logp. 

be the unique solution of the equation <f)(t) = e n (j){t — r n ). It follows from 
Lemma [5] that, 

{T n x T y > t (l + 5 P )} C f2 n C {r n xf y > t (l - S p )}. 

As a result, 

/ fo(y)dy > f fo(y) = P (T n xjY > t (l + S p )), 

Jfl„ JT n xfy>t (l+Sp) 



and 



/ h(y)dy> f fi(y)^Pi(r n xjY <t (l-5 p )). 



iT n x\ y<t (l-Sp) 

Note that under the null, x±Y = xJX(3 + xjz. It is seen that given x\, 
xfz ~ N(0,\ Xl \ 2 ), and |xi| 2 = l + 0(l/y/n). Also, it is seen that except for 
a probability of o(l/p), xJX(3 is algebraically small. It follows that 

Po(r n xjY > t (l + 5 p )) < l>(t ) = L(n)p-^, 

where $ = 1 — $ is the survival function of N{0,1). Similarly, under the 
alternative, 

x[y = r n {x\,x\) + xJXfi + xjz, 
where (x\,xi) = 1 + 0(l/y/n). So 

{ (#+r) 2 
L(n)p- + , r>tf 
L(n)p , < r < 

Combine these gives the theorem. □ 
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6.7.1. Proof of Lemma^ It is seen that 

P(D c n (co)) < J2P^X T X1 S > fc[l+^(l+V2c logp)] 2 ,for all S with |5| = k 

Fix k > 1. There are (|) different S with |5"| = k. It follows from |29| Lecture 
9] that except a probability of 2 exp(— Co log(p) • A;) that the largest eigenvalue 

of XgX$ is no greater than [1 + + \/2co log p)] 2 . So for any S with 

IS"! = /c, it follows from basic algebra that 

P{l T s X T Xl s > k[l + + v^cologp)] 2 ) < 2exp(-c log(p) • k). 

Combining these with (?) < p fc gives 

P(^(co)) < 2 E U) exp(-c (logp)&) < 2 £ exp(-(c - 1) log(p)fc). 

The claim follows by Co > 3. □ 

6.7.2. Proof of Lemma^ First, we claim that for any X in event D n (co), 
(68) \xJXp\<clog(p)(N0)/Vn), 



where c > is a generic constant. Suppose N n {(5) = k and the nonzero 
coordinates of (5 are i\, ■ ■ ■ , ik- Denote the {k + 1) x (k + 1) submatrix of 
X T X containing the I s *, (1 + ii)-th, . . ., and (1 + ifc)-th rows and columns 
by Uk+i- Let £1 be the {k + l)-vector with 1 on the first coordinate and 
elsewhere, let £2 be the (k + l)-vector with on the first coordinate and 1 
elsewhere. Then 

xj X/3 = T n £iU k+ ii 2 = T n ii{U k +i - 4+l)6- 

Let (t/fc+i — Ik+i) = Qk+i-^k+iQk+i be the orthogonal decomposition. By 
the definition of D n (co), all eigenvalues of {U^+i — Ik+i) are n ° greater than 
(1 + \/ c\og{p)k / n) 2 — 1 < \/clogp^/k/n in absolute value. As a result, all 
diagonal coordinates of A^+i are no greater than 



^/c\ogp^Jk/n 

in absolute value, and 

||^(C/ fe+ i-4 + l)6ll < ||^Qfe+iAfe + i|H|Qfe+i6ll < VclogpJk/n\\gQ k+1 \\-\\Q k+l &l 
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The claim follows from ||£i"Qfc+l|| = 1 and ||Qfe+i^2|| = Vk. 

We now show the lemma. Consider the first claim. Consider a realization 
of X in the event D n {co) and a realization of (3 in the event B n . By the 
definitions of B n , N n {(i) < pe n + \pe n . Recall that pe n = p l ~^, n = p 9 . 
It follows that log(p) N \0) / \fn < clog(p)pe n / \fn = clog(p)p l ~ 1? ~ 6 '/ 2 . Note 
that by the assumption of (1 — •&) < 9/2, the exponent is negative. Combine 
this with @, 

(69) 



Now, note that in the definition of a n (y) (i.e. (66)), the only difference be- 
tween the integrand on the top and that on the bottom is the term e~ TnX ^ x @. 
Combine this with (69) gives the claim. 



Consider the second claim. By the definitions of fi(y) and a n (y), 



h(y) = a n {y)4>{y - T n^\ 

= (j){y - T n xi) 



o y T X0-\Xf}\ 2 /2 



1b„W+ [e 



,y T X/3-\XP\ 2 /2 



yxP-\X P \V2 e -r nX lXP lBc]d ~ p + an[y) / [e y T Xf3-\Xtf/2 lBc]d p 



By the definition of fi{y), 



o y X /3-\X^/2 e -r n ^Xf3 lB j d ^ + f ^ X f3-\X 0f ^-r^XP^J^ 



h(y) = <t>(y-T n xi) 



Compare two equalities and recall that a n (y) ~ 1 (Lemma |4|, 



ll/i-/i||i< 
(70) 



4>{v - r nXl )[ / {e y T xMXf3\V2 + e y T xp-\x^/2 e -r nX lxp )lBcJ ^ ]dy 



<f>{y - r nXl - X(3)[e T " x ^ xp + l]l B cd0dy. 



Integrating over y, the last term is equal to J[l + e T n x iXf3^ . 
At the same time, by (68) and the definition of B^, 

(71) J[l+e TnX ^^}-l B cdf3 < J2 [l+e clos(p)fc/ ^]P(iV(/5) = k). 

{fc:|fc-pe n |>|pe n } 

Recall that pe n = p l ~® , n = p e , and (1— $) < 9/2. Using Bennett's inequality 
for P(N((3) = k) (e.g. |31| Page 440]), it follows from elementary calculus 
that 

(72) t 1 + e closip)k/ ^}P(N0) = k)= o(l/p). 

{k:\k-pe n \>^pe n } 
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Combining (70)— (72) gives the claim. □ 
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